Project Template¶
The following template repeats the essential steps for scraping, tokenizing, and mapping sentiment data. Each major step is further broken down into smaller steps. You can step through most of the code.
- You will have to create your own dataframe names
- Customize the filters
- Customize the maps
Note¶
When your individual input is required the following symbol will appear:

You should save your results to a pickle file after each major step. These points are indicated by the save icon:

2 Data Input and Wrangling¶
Introduction¶
In this step, you will download a CSV file from Gutenberg, modify it, and save it locally for further analysis.
Process Steps:¶
- Load a CSV file from a remote source
- Save the file locally
Step-by-Step Guide¶
Step 1: Import the Required Libraries¶
The primary library for handling CSV data in Python is pandas. Make sure it’s imported at the beginning of your script.
import pandas as pd
import pandas as pd
Step 2: Import Pickle File¶
In lesson 2_1 we created a clean pickle file of the catalog, we can simply start there rather than running through all those steps.
pg_catalog_clean=pd.read_pickle('pg_catalog_clean.pickle')

Step 3: Create a custom dataframe¶
Overview¶
In lesson 2_2 we learned how to filter dataframes. For this part of the project, you are going to create your own pg_catalog dataframe that includes the corpus of works you want to analyze. Your corpus can be as large as you want, but keep in mind that having a lot of text can significantly increase your processing time.
Requirements¶
Please follow the following parameters when creating our custom dataframe:
Unique descriptive dataframe name (i.e. not
df_virginia_history)Corpus should have a logical coherence.
You should aim for at least 25 texts
Save the resulting dataframe as a pickle file
Note¶
Remember to make a deep copy of your dataframe by using the method .copy()
Example¶
df_virginia_history = df_pg_catalog[
(df_pg_catalog.language == 'en') &
(df_pg_catalog.type == 'Text')
].copy()
Save the dataframe to a pickle file:
df_virginia_history.to_pickle('custom_file.pickle')
3 Scraping Gutenberg¶
Overview¶
In the previous section you were meant to create a custom dataframe. Here you are going to scrape that data frame. The scraping function assumes that your first column is called text_id. If that is not the case then something went wrong with the filtering process.
Load gutenberg_scraper¶
Since you only need to perform one function fetch_text_data() all of the other logic has been tucked away in a .py file. You can load the python file and function in like any other import as long as the file is present in your root directory.
from gutenberg_scraper import fetch_text_data
#This line is here to let you walk through the code. It takes a random sample from the Project Gutenberg Catalog and runs it through all the steps.
YOUR_DATAFRAME = pg_catalog_clean.sample(10)
fetch_text_data(YOUR_DATAFRAME)
100%|██████████| 10/10 [00:08<00:00, 1.16it/s]
| text_id | type | issued | title | language | subjects | locc | bookshelves | second_author | last_name | first_name | birth | death | text_data | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 42709 | 42783 | Text | 2013-05-24 | Notes and Queries, Number 168, January 15, 185... | en | Questions and answers -- Periodicals | AG | Notes and Queries; Browsing: Culture/Civilizat... | Bell, George, 1814-1890 [Editor] | Various | <NA> | <NA> | <NA> | NOTES AND QUERIES, NUMBER 168, JANUARY 15, 185... |
| 29725 | 29797 | Text | 2009-08-25 | The Prison Chaplaincy, And Its Experiences | en | Prisons -- New Hampshire; Prisoners -- Religio... | HV | Browsing: History - American; Browsing: Religi... | <NA> | Quinby | Hosea | 1804 | 1878 | THE PRISON CHAPLAINCY, AND ITS EXPERIENCES ***... |
| 60350 | 60427 | Text | 2019-10-05 | Introducing Irony: A Book of Poetic Short Stor... | en | American poetry -- 20th century; Short stories... | PS | Browsing: Literature; Browsing: Poetry; Browsi... | <NA> | Bodenheim | Maxwell | 1893 | 1954 | INTRODUCING IRONY: A BOOK OF POETIC SHORT STOR... |
| 68907 | 69000 | Text | 2022-09-16 | The private life of Henry Maitland: A record d... | en | Gissing, George, 1857-1903 | PR | Browsing: Biographies; Browsing: Literature | <NA> | Roberts | Morley | 1857 | 1942 | THE PRIVATE LIFE OF HENRY MAITLAND: A RECORD D... |
| 43403 | 43477 | Text | 2013-08-15 | Bell's Cathedrals: The Cathedral Church of Lin... | en | Lincoln Cathedral | <NA> | Browsing: Architecture; Browsing: Art & Photog... | <NA> | Kendrick | A. F. (Albert Frank) | 1872 | 1954 | BELL'S CATHEDRALS: THE CATHEDRAL CHURCH OF LIN... |
| 40084 | 40158 | Text | 2012-07-07 | Manners: A Novel, Vol 1 | en | England -- Social life and customs -- Fiction | PR | Browsing: Culture/Civilization/Society; Browsi... | <NA> | Panache | Madame | <NA> | <NA> | MANNERS: A NOVEL, VOL 1 ***\r\n\r\n\r\n\r\n\r\... |
| 33998 | 34071 | Text | 2010-10-15 | The Evolution of an Empire: A Brief Historical... | en | France -- History | DC | Browsing: History - European; Browsing: Histor... | <NA> | Parmele | Mary Platt | 1843 | 1911 | *** |
| 4898 | 4939 | Text | 2004-01-01 | Chopin : the Man and His Music | en | Chopin, Frédéric, 1810-1849; Composers -- Biog... | ML | Browsing: Biographies; Browsing: Music | <NA> | Huneker | James | 1857 | 1921 | *** |
| 477 | 496 | Text | 2006-01-16 | The Little Lame Prince | en | Children's literature; Fairy tales; Princes --... | PZ | Children's Literature; Browsing: Children & Yo... | <NA> | Craik | Dinah Maria Mulock | 1826 | 1887 | THE LITTLE LAME PRINCE ***\r\n\r\n\r\n\r\n\r\n... |
| 37937 | 38011 | Text | 2011-11-14 | The Epic of Hades, in Three Books | en | Gods, Greek -- Poetry; Mythology, Greek -- Poetry | PR | Browsing: Culture/Civilization/Society; Browsi... | <NA> | Morris | Lewis | 1833 | 1907 | THE EPIC OF HADES, IN THREE BOOKS ***\r\n\r\n\... |
Export Pickle 1¶

Export the file as a pickle file for presentation.
YOUR_DATAFRAME.to_pickle('YOUR_DATAFRAME_TEXTS.pickle')
YOUR_DATAFRAME.to_pickle('YOUR_DATAFRAME_TEXTS.pickle')

4 Clean DataFrame for Analysis¶
To prepare our data for analysis we will:
- Split it into sentences
- Clean the individual sentences
- Drop unnecessary data
For the code below you will have to replace YOUR_DATAFRAME with the name of your dataframe.
Import NLTK¶
We can use NLTK for some basic preprocessing.
import nltk
import re
import nltk
import re
Step 1: Tokenize Text into Sentences¶
# Explodes the DataFrame so that each row corresponds to a single sentence
YOUR_DATAFRAME = YOUR_DATAFRAME.assign(
sentences=YOUR_DATAFRAME['text_data'].apply(nltk.sent_tokenize)
).explode('sentences')
Step 2: Remove the 'text_data' column¶
YOUR_DATAFRAME = YOUR_DATAFRAME.drop(columns='text_data')
Step 3: Define a Cleaning Function for Sentences¶
def clean_sentence(sentence):
# 1. Remove text inside square brackets
sentence = re.sub(r'\[.*?\]', '', sentence)
# 2. Remove unwanted punctuation but retain sentence-ending punctuation
sentence = re.sub(r'[^\w\s,.!?\'"‘’“”`]', '', sentence)
# 3. Remove newline and carriage return characters, and underscores
sentence = sentence.replace('\n', ' ').replace('\r', ' ').replace('_', '')
# 4. Return an empty string for all-uppercase sentences (likely headers or TOC entries)
return '' if sentence.isupper() else sentence
Step 4: Apply Cleaning and Remove Empty Sentences¶
# Apply the cleaning function, then filter out any sentences that are empty strings
YOUR_DATAFRAME['cleaned_sentences'] = YOUR_DATAFRAME['sentences'].apply(clean_sentence)
YOUR_DATAFRAME = YOUR_DATAFRAME[YOUR_DATAFRAME['sentences'] != '']
Step 5: Reset Index for the Cleaned DataFrame¶
YOUR_DATAFRAME = YOUR_DATAFRAME.reset_index(drop=True)
Step 6: (OPTIONAL) Save deep copy of dataframe and pickle¶
#YOUR_DATAFRAME_DEEP = YOUR_DATAFRAME.copy()
#YOUR_DATAFRAME_DEEP.to_pickle('YOUR_DATAFRAME_DEEP.pickle')
5 Perform Initial Tokenization¶
Overview¶
Since the geoparsing process is quite intense, we can actually reduce our processing overhead a bit by eliminating those sentences that likely don't have toponyms. We can do so by first running a pass with the lightweight en_core_web_sm spacy library
Load Spacy¶
We are going to load spacy and the small library at the same time.
import spacy
from tqdm import tqdm
tqdm.pandas()
# Load spaCy's English model
nlp = spacy.load('en_core_web_sm')
/home/joostburgers/anaconda3/lib/python3.12/site-packages/torch/cuda/__init__.py:654: UserWarning: Can't initialize NVML
warnings.warn("Can't initialize NVML")
Load Functions into memory¶
# Function to extract GPE (Geopolitical Entities) from a batch of docs
def extract_gpe_from_docs(docs):
return [[ent.text for ent in doc.ents if ent.label_ == 'GPE'] or None for doc in docs]
# Use nlp.pipe() for faster batch processing with multiple cores
def process_sentences_in_batches(sentences, batch_size=50, n_process=-1):
# Process sentences using nlp.pipe with batch processing and multi-processing
gpe_results = []
for doc in tqdm(nlp.pipe(sentences, batch_size=batch_size, n_process=n_process), total=len(sentences)):
gpes = [ent.text for ent in doc.ents if ent.label_ == 'GPE']
gpe_results.append(gpes if gpes else None)
return gpe_results
Process your DataFrame¶
YOUR_DATAFRAME['toponyms'] = process_sentences_in_batches(YOUR_DATAFRAME['cleaned_sentences'])
(YOUR_DATAFRAME['cleaned_sentences'])
100%|██████████| 14564/14564 [01:40<00:00, 145.05it/s]
0 NOTES AND QUERIES, NUMBER 168, JANUARY 15, 185...
1 "When found, make a note of."
2
3 No.
4 168.
...
14559 See the Orphic Hymns.
14560
14561 Transcriber's Notes This text is hemistichi...
14562 Inconsistent Hyphenation and text retained.
14563
Name: cleaned_sentences, Length: 14564, dtype: object
Clean up the result¶
As we saw in the most intense part of the extraction process in lesson_5 we want to reduce the number of sentences being processed to lower the computation time. We can do two things at this stage.
- Eliminate unnecessary columns
- Eliminate all sentences for which there is no result
- Eliminate all sentences with very few results. Your group can decide on the threshold, but suffice to say that all toponyms with a count of 1 won't be relevant. You can adjust this number as you fine-tune your model.
Eliminate Unncessary Columns¶
YOUR_DATAFRAME = YOUR_DATAFRAME.drop(columns=['language', 'issued', 'type', 'locc', 'bookshelves', 'second_author']).copy()
Eliminate None¶
YOUR_DATAFRAME = YOUR_DATAFRAME[YOUR_DATAFRAME.toponyms.notna()]
To eliminate the some of the complicated processing the function below adds a count column to the dataframe.
def add_toponym_count(df, toponym_col='toponyms', sentence_col='cleaned_sentences'):
"""
Processes the DataFrame to count toponyms and aggregate back to sentences, keeping all original columns.
Args:
df (pd.DataFrame): The DataFrame containing toponyms and sentences.
toponym_col (str): Column containing the toponyms as lists.
sentence_col (str): Column containing the cleaned sentences.
Returns:
pd.DataFrame: A DataFrame grouped by sentences with a list of toponyms, their counts, and all original columns.
"""
# Step 1: Explode the 'toponyms' column
exploded_df = df.explode(toponym_col)
# Step 2: Group by 'toponyms' to count occurrences and add 'nltk_toponym_count' column
toponym_counts = exploded_df.groupby(toponym_col).size().reset_index(name='nltk_toponym_count')
# Step 3: Merge the counts back to the exploded DataFrame
exploded_df = exploded_df.merge(toponym_counts, on=toponym_col, how='left')
# Step 4: Group by 'cleaned_sentences' and aggregate all columns
# Use 'first' to retain the first non-null value for each original column, and 'list' for the toponym_col
aggregation_dict = {col: 'first' for col in df.columns if col not in [sentence_col, toponym_col]}
aggregation_dict[toponym_col] = lambda x: list(x) # Aggregate toponyms into lists
aggregation_dict['nltk_toponym_count'] = 'first' # Take the first count (all counts are the same within groups)
result_df = exploded_df.groupby(sentence_col).agg(aggregation_dict).reset_index()
return result_df
YOUR_DATAFRAME = add_toponym_count(YOUR_DATAFRAME)

Filter out low toponym counts¶
Your dataframe now has the new variable nltk_toponym_count. You can filter out low count results to get fewer sentences. You can get a data frame for all cleaned sentences where the nltk_toponym_count is greater than 1.
#Remove toponyms with a low count

(Optional) Save pickle of tokenization¶
#YOUR_DATAFRAME_TOPONYM.to_pickle('YOUR_DATAFRAME_TOPONYM.pickle')
6 Geoparsing (Deep Scan)¶
Overview¶
Since the deep scan for toponyms will likely reduce the size of the dataframe again, we can backload the sentiment analysis as the last step to ensure we don't process data unnecessarily.
from geoparser import Geoparser
from tqdm.notebook import tqdm
Because there are some compatibility issues with the geoparser package, there are pesky warnings that pop-up. These do not affect the output, but they are annoying. The line below filters these out of the console.
import warnings
# Suppress all FutureWarnings
warnings.simplefilter(action='ignore', category=FutureWarning)
Load Geoparser¶
To use Geoparser, instantiate an object of the Geoparser class with optional specifications for the spaCy model, transformer model, and gazetteer. By default, the library uses an accuracy-optimised configuration:
geo = Geoparser(spacy_model='en_core_web_trf', transformer_model='dguzh/geo-all-distilroberta-v1', gazetteer='geonames')
Load in the geoparse_column function to simplify the toponym recognition process.
def geoparse_column(df):
sentences = df['cleaned_sentences'].tolist() # Convert column to list
docs = geo.parse(sentences, feature_filter=['A', 'P']) # Run geo.parse on the entire list
# Initialize lists to store the extracted fields
places, latitudes, longitudes, feature_names = [], [], [], []
# Iterate through the results and extract toponyms and their locations
for doc in docs:
doc_places = []
doc_latitudes = []
doc_longitudes = []
doc_feature_names = []
for toponym in doc.toponyms:
if toponym.location:
doc_places.append(toponym.location.get('name'))
doc_latitudes.append(toponym.location.get('latitude'))
doc_longitudes.append(toponym.location.get('longitude'))
doc_feature_names.append(toponym.location.get('feature_name'))
else:
doc_places.append(None)
doc_latitudes.append(None)
doc_longitudes.append(None)
doc_feature_names.append(None)
# Append the extracted data for the document
places.append(doc_places)
latitudes.append(doc_latitudes)
longitudes.append(doc_longitudes)
feature_names.append(doc_feature_names)
# Assign the extracted data to the DataFrame as new columns
df['place'] = places
df['latitude'] = latitudes
df['longitude'] = longitudes
df['feature_name'] = feature_names
return df
geoparse_column(YOUR_DATAFRAME)
Toponym Recognition...
Batches: 0%| | 0/1388 [00:00<?, ?it/s]
Toponym Resolution...
Batches: 0%| | 0/873 [00:00<?, ?it/s]
Batches: 0%| | 0/158 [00:00<?, ?it/s]
| cleaned_sentences | text_id | title | subjects | last_name | first_name | birth | death | sentences | toponyms | nltk_toponym_count | place | latitude | longitude | feature_name | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 76 Now re... | 42783 | Notes and Queries, Number 168, January 15, 185... | Questions and answers -- Periodicals | Various | <NA> | <NA> | <NA> | * * * * *\r\n\r\n\r\n{... | [pp] | 4 | [] | [] | [] | [] |
| 1 | ARCHER'S PHO... | 42783 | Notes and Queries, Number 168, January 15, 185... | Questions and answers -- Periodicals | Various | <NA> | <NA> | <NA> | * * * * *\r\n\r\n\r\nA... | [MR] | 3 | [] | [] | [] | [] |
| 2 | TO PHOTOGRAP... | 42783 | Notes and Queries, Number 168, January 15, 185... | Questions and answers -- Periodicals | Various | <NA> | <NA> | <NA> | * * * * *\r\n\r\n\r\nT... | [Brébisson] | 1 | [] | [] | [] | [] |
| 3 | If you have received the various lett... | 40158 | Manners: A Novel, Vol 1 | England -- Social life and customs -- Fiction | Panache | Madame | <NA> | <NA> | ----\r\n\r\n If you have received the vari... | [England] | 109 | [England, Rhodes, Algiers] | [52.16045, 53.54368, 36.73225] | [-0.70312, -2.22551, 3.08746] | [first-order administrative division, populate... |
| 4 | George Drew, Esq. | 42783 | Notes and Queries, Number 168, January 15, 185... | Questions and answers -- Periodicals | Various | <NA> | <NA> | <NA> | ;\r\n George Drew, Esq. | [Esq] | 22 | [] | [] | [] | [] |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1383 | the old world Is full of Hate. | 38011 | The Epic of Hades, in Three Books | Gods, Greek -- Poetry; Mythology, Greek -- Poetry | Morris | Lewis | 1833 | 1907 | the old world\r\nIs full of Hate. | [Hate] | 1 | [] | [] | [] | [] |
| 1384 | vi., pp. | 42783 | Notes and Queries, Number 168, January 15, 185... | Questions and answers -- Periodicals | Various | <NA> | <NA> | <NA> | vi., pp. | [pp] | 4 | [] | [] | [] | [] |
| 1385 | was written, one of these fabrics of the olden... | 42783 | Notes and Queries, Number 168, January 15, 185... | Questions and answers -- Periodicals | Various | <NA> | <NA> | <NA> | was written, one of these fabrics of the olden... | [Kington] | 1 | [] | [] | [] | [] |
| 1386 | what fine pearls you have got onyour mother's... | 40158 | Manners: A Novel, Vol 1 | England -- Social life and customs -- Fiction | Panache | Madame | <NA> | <NA> | what\r\nfine pearls you have got on--your _mot... | [suppose, Miss.] | 1 | [] | [] | [] | [] |
| 1387 | xi., ver. | 43477 | Bell's Cathedrals: The Cathedral Church of Lin... | Lincoln Cathedral | Kendrick | A. F. (Albert Frank) | 1872 | 1954 | xi., ver. | [xi.] | 1 | [Budapest XI.] | [47.475] | [19.03605] | [second-order administrative division] |
1388 rows × 15 columns

Export Pickle 2¶
As the geoparsing process takes a long time, you should store it right after the result. You will also import these results for your project_presentation_template
YOUR_DATAFRAME.to_pickle('YOUR_DATAFRAME_PLACES.pickle')
YOUR_DATAFRAME.to_pickle('YOUR_DATAFRAME_PLACES.pickle')
Clean up the resulting dataframe¶
As with the previous instance of toponym resolution, there will be some rows that do not contain relevant information. This will slow down the sentiment analysis.
- Eliminate empty results
YOUR_DATAFRAME = YOUR_DATAFRAME[YOUR_DATAFRAME['place'].str.len() != 0].copy()
7 Run Sentiment Analysis¶
Overview¶
We will now implement the sentiment analysis on the remaining sentences.
Read step through and read all the prerequisites into memory.
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
from tqdm import tqdm
from scipy.special import softmax
from typing import Dict, Any
# Initialize RoBERTa. There will probably be a warning. You can ignore this.
MODEL = "cardiffnlp/twitter-roberta-base-sentiment"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL)
config.json: 0%| | 0.00/747 [00:00<?, ?B/s]
vocab.json: 0%| | 0.00/899k [00:00<?, ?B/s]
merges.txt: 0%| | 0.00/456k [00:00<?, ?B/s]
special_tokens_map.json: 0%| | 0.00/150 [00:00<?, ?B/s]
pytorch_model.bin: 0%| | 0.00/499M [00:00<?, ?B/s]
# Function to calculate RoBERTa sentiment scores
def polarity_scores_roberta(text: str) -> Dict[str, float]:
"""
Calculate RoBERTa sentiment scores for a given text.
Args:
- text: The text to analyze
Returns:
- A dictionary with sentiment scores for negative, neutral, and positive sentiment
"""
# Tokenize and truncate to max length (512 tokens)
encoded_text = tokenizer.encode_plus(
text,
max_length=512,
truncation=True,
return_tensors='pt'
)
# Get model output and convert to probabilities
output = model(**encoded_text)
scores = output[0][0].detach().numpy()
scores = softmax(scores)
return {
'roberta_neg': scores[0],
'roberta_neu': scores[1],
'roberta_pos': scores[2]
}
# Function to attach sentiment analysis to a specific column in the dataframe
def add_sentiment_to_column(
df: pd.DataFrame, column_name: str, num_rows: int = None
) -> pd.DataFrame:
"""
Adds RoBERTa sentiment analysis to a specified column in a dataframe.
Args:
- df: The dataframe to process
- column_name: The name of the column containing the text to analyze
- num_rows: The number of rows to process (default: 500)
Returns:
- df: A dataframe with added sentiment analysis columns
"""
# If num_rows is specified, limit the dataframe, otherwise process all rows
if num_rows:
df_subset = df.head(num_rows).reset_index(drop=True)
else:
df_subset = df.reset_index(drop=True) # Process all rows and reset the index
# Function to process each row and add sentiment analysis
def process_row(text: str) -> Dict[str, Any]:
try:
return polarity_scores_roberta(text)
except Exception as e:
print(f"Error processing text: {text}. Error: {e}")
return {'roberta_neg': None, 'roberta_neu': None, 'roberta_pos': None}
# Apply the RoBERTa sentiment analysis to each row
tqdm.pandas(desc="Processing Sentiment Analysis")
sentiment_scores = df_subset[column_name].progress_apply(process_row)
# Convert the resulting list of dictionaries into a DataFrame and concatenate it with the original subset
sentiment_df = pd.DataFrame(sentiment_scores.tolist())
df_subset = pd.concat([df_subset, sentiment_df], axis=1)
return df_subset
YOUR_DATAFRAME = add_sentiment_to_column(YOUR_DATAFRAME, 'cleaned_sentences')
Processing Sentiment Analysis: 100%|██████████| 788/788 [01:04<00:00, 12.27it/s]
Create an aggregate score¶
Since the roberta score is positive, negative, and neutral, we will have to consolidate it into one easier to understand score. We will take the difference between positive and negative, and multiply it by the percentage of neutral. This way if a score is very neutral it will even out the difference between positive and negative.
# Calculate the compound score and add it as a new column 'roberta_compound'
YOUR_DATAFRAME['roberta_compound'] = (
YOUR_DATAFRAME['roberta_pos'] - YOUR_DATAFRAME['roberta_neg']
) * (1 - YOUR_DATAFRAME['roberta_neu'])
Explode, filter, and aggregate¶
At the moment, there are places and sentiments, but since some of the sentences contain multiple places these need to be unnested.
YOUR_DATAFRAME = YOUR_DATAFRAME.explode(['place', 'latitude', 'longitude', 'feature_name'])
Remove empty values
YOUR_DATAFRAME = YOUR_DATAFRAME[YOUR_DATAFRAME.place.notna()]
Aggregate the data
YOUR_DATAFRAME = YOUR_DATAFRAME.groupby('place').agg(
location_count=('place', 'size'), # Count occurrences of each location
latitude=('latitude', 'first'), # Take the first latitude (you can also use 'mean')
longitude=('longitude', 'first'), # Take the first longitude (or 'mean')
location=('feature_name','first'),
avg_roberta_pos=('roberta_pos', 'mean'), # Average of roberta_pos
avg_roberta_neu=('roberta_neu', 'mean'), # Average of roberta_neu
avg_roberta_neg=('roberta_neg', 'mean'), # Average of roberta_neg
avg_roberta_compound =('roberta_compound','mean')
).reset_index()
Create Histogram of Count Values (Optional)¶
To get a sense of how the data is distributed and to decide which data to include, you can create a histogram of the location_count column fairly easily.
import matplotlib.pyplot as plt
# You might need to install matplotlib with
# pip install matplotlib
YOUR_DATAFRAME.location_count.plot.hist(bins=10, alpha=0.7)
<Axes: ylabel='Frequency'>
Generally, the data will be very left skewed. You might want to filter out some of the lower values.
Filter out low counts¶
As very low counts will not show up on the map anyway, filter them out here. No code has been provided, but the procedure is essentially the same as before.
#code here
Bucket Data¶
As we saw in lesson_5, the distribution of the data is tricky. We can solve this by bucketing it along the lines of Jenks Natural Breaks.
import mapclassify as mc #you may get an error. If so install mapclassify with pip install mapclassify
jenks_breaks = mc.NaturalBreaks(y=YOUR_DATAFRAME['location_count'], k=5)
YOUR_DATAFRAME.loc[:,'location_count_bucket'] = jenks_breaks.find_bin(YOUR_DATAFRAME['location_count'])+1
Export Pickle 3¶
This is the final export of the file for the project_presentation_template
YOUR_DATAFRAME.to_pickle('YOUR_DATAFRAME_SENTIMENTS.pickle')
YOUR_DATAFRAME.to_pickle('YOUR_DATAFRAME_SENTIMENTS.pickle')
Map your Data¶
Overview¶
This is the core of the project. Use the stub below to map your data and then customize the map. I have deliberately set some of the values very poorly to encourage you to work on your own map!
import plotly.express as px
fig = px.scatter_mapbox(
YOUR_DATAFRAME, #put your dataframe here
lat="latitude", # Latitude column
lon="longitude", # Longitude column
size="location_count_bucket", # Bubble size based on location count
color="avg_roberta_compound", # Color based on sentiment score
color_continuous_scale=px.colors.cyclical.Twilight[::-1], # Use Twilight scale (blue to red)
size_max=30, # Maximum size of the bubbles
center={"lat": 48, "lon": 2},
zoom=6 # Adjust zoom level for better visibility
)
# Update the layout to use the default map style (which doesn't need a token)
fig.update_layout(
mapbox_style="open-street-map", # No token needed for this style
margin={"r":0,"t":0,"l":0,"b":0} # Remove margins for a cleaner view
)
fig.show()
Happy mapping!